One X Variable

Module 3.2: Linear Regression

Author
Affiliation

Alex Cardazzi

Old Dominion University

All materials can be found at alexcardazzi.github.io.

Correlation

In the last part of the module, we were building/estimating a correlation between GPAs and SAT scores so we would have a tool for guidance counselors to predict SAT scores given a student’s GPA. Or, maybe we were hired by real estate agents to create a tool to predict a home’s sale price given some features about the property. Or, maybe we want to estimate the impact some government program like SNAP has on employment outcomes for citizens. How can we use correlations to inform us?

Conditional Mean

Let’s suppose today is your first day as a guidance counselor and your first student comes into your office and says, “what do you think I’ll get on my SAT?” Unfortunately, you do not know anything about them, so the best guess you can come up with is… the average.

However, if the student told you that they have a 2.8 GPA, you would likely adjust your guess downwards.1

By adjusting your guess downwards, you are implicitly calculating a conditional mean. When we know nothing, the best guess we can come up with is the mean of the entire distribution. We talked about this in Module 2. However, if you have some additional information (e.g., GPA) before making your guess, your guess can only improve. We can demonstrate this with a simple simulation.

Let’s revisit the data that contained all-star basketball player heights. If you told me that you were going to select a player at random and ask me to guess their height, the “best” number to pick would be the mean of the data. Like we did with standard deviation, let’s define how wrong we are by the squared difference of our “mistake”, or error. Let’s simulate this:

Code
library("scales")
bball <- read.csv("https://raw.githubusercontent.com/alexCardazzi/alexcardazzi.github.io/main/econ311/data/bball_allstars.csv")

# Create a function to randomly sample n times
#   and calculate the squared difference from the guess.
# The function then returns the average of all n draws.
sim <- function(my_pick, n){
  
  sqr_diff <- rep(NA, n)
  for(i in 1:n){
    
    random_player <- sample(1:nrow(bball), 1)
    sqr_diff[i] <- (my_pick - bball$HEIGHT[random_player])^2
  }
  mean(sqr_diff)
}

# I am going to come up with MANY guesses.
# I am going to start 6 inches below the mean
#   and increase by a tenth of an inch until
#   I reach 6 inches above the mean.
the_picks <- seq(mean(bball$HEIGHT) - 6, mean(bball$HEIGHT) + 6, by = 0.1)

avg_sqr_diffs <- NULL
for(pick in the_picks){
  
  avg_sqr_diffs[length(avg_sqr_diffs) + 1] <- sim(pick, 100)
}

plot(the_picks, avg_sqr_diffs, las = 1,
     xlab = "Guess for Height", pch = 19,
     ylab = "Average Squared Error",
     col = alpha("dodgerblue", 0.33))
abline(v = mean(bball$HEIGHT), col = "tomato")
legend("bottomleft", "Average Height", lty = 1, col = "tomato", bty = "n")
Plot

Average error by guess, which is a U-shape with the bottom ocurring around the average height of all players.

Like the student told us their GPA and we adjusted our guess about their SAT, we would do the same if I told you I would only randomly sample from WNBA players. Let’s rerun the simulation, except only sampling from the WNBA distribution.

Code
sim <- function(my_pick, n){
  
  sqr_diff <- rep(NA, n)
  for(i in 1:n){
    
    random_player <- sample(which(bball$LEAGUE == "WNBA"), 1)
    sqr_diff[i] <- (my_pick - bball$HEIGHT[random_player])^2
  }
  mean(sqr_diff)
}

the_picks <- seq(mean(bball$HEIGHT) - 6, mean(bball$HEIGHT) + 6, by = 0.1)

avg_sqr_diffs <- NULL
for(pick in the_picks){
  
  avg_sqr_diffs[length(avg_sqr_diffs) + 1] <- sim(pick, 100)
}

plot(the_picks, avg_sqr_diffs, las = 1,
     xlab = "Guess for Height", pch = 19,
     ylab = "Average Squared Error",
     col = alpha("dodgerblue", 0.33))
abline(v = mean(bball$HEIGHT), col = "tomato")
abline(v = mean(bball$HEIGHT[bball$LEAGUE == "WNBA"]), col = "mediumseagreen", lty = 2)
legend("topleft", c("Average Height", "Average WNBA Height"),
       lty = 1:2, col = c("tomato", "mediumseagreen"), bty = "n")
Plot

Average error by guess, which is a U-shape with the bottom ocurring around the average height of WNBA players.

In both cases, you can see that the points reach a (near) minimum error when the “guess” is near the mean of the respective distribution.

Keep in mind that our goal here was to minimize the sum (or mean) of squared errors. We will revisit this soon, but first, we are going to jump back to GPAs and SAT scores.

Our goal as guidance counselors is to give students our best guess for what they’ll score on their SAT. In essence, we want a function that converts a GPA into an SAT score. We want this function to have a few properties:

  1. SAT score must be dependent on GPA.
  2. SAT score should increase as GPA increases (since they’re positively correlated).
  3. Given an average GPA, the model should output an average SAT.
  4. The model should be correct on average.

Like we did with the law of demand, we are going to guess a functional form for the relationship between GPA and SAT:

\[\text{SAT}_i = \beta_0 + \beta_1 \times \text{GPA}_i\]

This functional form satisfies our first two conditions right off the bat. However, we still need to select values for \(\beta_0\) and \(\beta_1\) to satisfy the third and fourth conditions.

This equation of a line gives us the conditional expectation of SAT given GPA. This is most often written as \(E[ \ \text{SAT} \ | \ \text{GPA} \ ]\). You might say, “that’s great, but what values should we use for \(\beta_0\) and \(\beta_1\)?” Let’s pick a few different values and plot them along with the data.

Code
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/satgpa.csv")
plot(df$hs_gpa, df$sat_sum, las = 1, pch = 19,
     col = alpha("black", 0.2),
     xlab = "GPA", ylab = "SAT")
abline(a = 10, b = 30, lty = 1, col = "tomato", lwd = 2)
abline(a = 40, b = 20, lty = 2, col = "orchid", lwd = 2)
abline(a = 70, b = 10, lty = 3, col = "dodgerblue", lwd = 2)
legend("bottomright", c("B0 = 10; B1 = 30",
                        "B0 = 40; B1 = 20",
                        "B0 = 70; B1 = 10"),
       col = c("tomato", "orchid", "dodgerblue"),
       lty = 1:3, bty = "n", lwd = 2)
Plot

Scatter plot of SAT vs GPA, with three possible guesses for regression lines.

No matter what values we choose for \(\beta_0\) and \(\beta_1\), there’s never going to be a single line that runs through every one of the points. To fix this, let’s write the equation as follows:

\[\text{SAT}_i = \beta_0 + \beta_1\text{GPA}_i + \epsilon_i\] where \(\epsilon_i\) (epsilon) is our error term. This will make up the difference between the point and the line we draw through the points.

Line of Best Fit

Ideally, we wouldn’t need \(\epsilon\) at all and our model would perfectly fit the data. However, since we do need \(\epsilon\), we want to minimize it. Specifically, like we did when guessing heights of basketball players, we are going to minimize the sum of squared errors, or \(\sum \epsilon_i^2\). Note that we can calculate the errors associated with choosing \(\beta_0\) and \(\beta_1\) by the following: \(\epsilon_i = \text{SAT}_i - \beta_0 - \beta_1 \text{GPA}_i\).

Since the errors will be minimized, the line given by \(\beta_0\) and \(\beta_1\) will be colloquially known as the “line of best fit”.

Ordinary Least Squares

The algorithm we are going to use to solve for \(\beta_0\) and \(\beta_1\) is called Ordinary Least Squares, or OLS. To begin, let’s start with writing an expression for the sum of squared errors.

\[S = \sum \epsilon^2= \sum (Y - \beta_0 - \beta_1X)^2 \] When we want to minimize an expression, we must take the derivative and set it equal to zero. Since \(\beta_0\) and \(\beta_1\) are unknown, these are what we need to take the derivative with respect to. First, we will focus on \(\beta_0\).

The following equations have text if you hover over them:

\[\frac{\partial S}{\partial \hat{\beta}_0} = \sum 2\times (Y - \hat{\beta}_0 - \beta_1X) \times (-1) = 0\]

\[\frac{1}{-2} \times \sum 2\times (Y - \hat{\beta}_0 - \beta_1X) \times (-1) = \frac{1}{-2} \times 0\]

\[\sum Y - \sum \hat{\beta}_0 - \sum \beta_1X = 0\]

\[\sum Y - \beta_1\sum X = \hat{\beta}_0 \sum 1\]

\[\sum Y - \beta_1\sum X = n \times \hat{\beta}_0 \]

\[\frac{1}{n}\sum Y - \beta_1\frac{1}{n}\sum X = \hat{\beta}_0 \]

\[\overline{Y} - \beta_1 \overline{X} = \hat{\beta}_0\]

Now, how do we solve for \(\hat{\beta}_1\)? Use the same approach:

\[\frac{\partial S}{\partial \hat{\beta}_1} = \sum 2\times (Y - \hat{\beta}_0 - \hat{\beta}_1X) \times (-X) = 0\]

\[\sum(Y - \hat{\beta}_0 - \hat{\beta}_1X) \times (X) = 0\]

\[\sum XY - \hat{\beta}_0\sum X - \hat{\beta}_1 \sum X^2 = 0\]

\[\sum XY - \Big[\overline{Y} - \hat{\beta}_1 \overline{X}\Big]\sum X - \hat{\beta}_1 \sum X^2 = 0\]

\[\sum XY - \overline{Y}\sum X + \hat{\beta}_1 \overline{X}\sum X - \hat{\beta}_1 \sum X^2 = 0\]

\[\sum XY - \overline{Y}\sum X = \hat{\beta}_1 \sum X^2 - \hat{\beta}_1 \overline{X}\sum X\]

\[\sum XY - \overline{Y}\sum X = \hat{\beta}_1 \Big[\sum X^2 - \overline{X}\sum X\Big]\]

\[\frac{\sum XY - \overline{Y}\sum X}{\sum X^2 - \overline{X}\sum X} = \hat{\beta}_1\]

\[\frac{cov(X,Y)}{var(X)} = \hat{\beta}_1\]

\[\overline{Y} - \frac{cov(X,Y)}{var(X)} \overline{X} = \hat{\beta}_0\]

As a note, we can also express \(\beta_1\) in terms of the correlation coefficient \(\rho\):

\[\frac{cov(X,Y)}{var(X)} = \frac{cov(X,Y)}{\sigma_X\sigma_Y}\times\frac{\sigma_Y}{\sigma_X} = \rho\times\frac{\sigma_Y}{\sigma_X}\]

Okay, that was a lot of math. In fact, this is the most math you’ll see in this course. So, take a breath and grab a water.2

Now, let’s use some of this math to figure out \(\beta_0\) and \(\beta_1\) for our GPA and SAT example:

Code
cov_xy <- cov(df$hs_gpa, df$sat_sum)
var_x <- var(df$hs_gpa)
beta1 <- cov_xy / var_x
beta0 <- mean(df$sat_sum) - (beta1 * mean(df$hs_gpa))
cat("SAT =", beta0, "+", beta1, "x GPA")
Output
SAT = 66.98845 + 11.36317 x GPA

This expression gives us the expectation of a student’s SAT score given a student’s GPA. We know for sure that this minimizes the sum of squared errors (because all of that math), but is the model correct on average?

Code
sat_expectation <- beta0 + beta1*df$hs_gpa
sat_real <- df$sat_sum
round(mean(sat_real - sat_expectation), 5)
Output
[1] 0

What does the model look like relative to the data?

Code
plot(df$hs_gpa, df$sat_sum, las = 1, pch = 19,
     col = alpha("black", 0.2),
     xlab = "GPA", ylab = "SAT")
abline(a = 10, b = 30, lty = 1, col = "tomato", lwd = 2)
abline(a = 40, b = 20, lty = 2, col = "orchid", lwd = 2)
abline(a = 70, b = 10, lty = 3, col = "dodgerblue", lwd = 2)
abline(a = beta0, b = beta1, lty = 1, col = "gold", lwd = 4)
legend("bottomright", c("B0 = 10; B1 = 30",
                        "B0 = 40; B1 = 20",
                        "B0 = 70; B1 = 10",
                        "OLS"), ncol = 2,
       col = c("tomato", "orchid", "dodgerblue", "gold"),
       lty = c(1:3, 1), bty = "n", lwd = c(2, 2, 2, 4))
Plot

Scatter plot of GPA and SAT plus three possible guesses for line of best fit.  In addition, an overlayed OLS line.

In fact, our initial guess of SAT = 70 + 10*GPA was pretty good, but OLS found a better line.

In terms of interpretation, what does this equation mean in words?

  • \(\beta_1\) (11.36): If we increase GPA by one unit (e.g., from 2.2 to 3.2), we expect the sum of our SAT percentiles to increase by 11.36. In more abstract terms, “a one unit increase in \(X\) will change \(Y\) by \(\beta_1\).”
  • \(\beta_0\) (66.99): If someone has a GPA of 0, we expect their sum of SAT percentiles to be equal to 66.99. In more abstract terms, “the value of \(Y\) when \(X=0\).”

As long as “a one unit change” makes sense for \(X\), \(\beta_1\) will be interpretable. For \(\beta_0\) to be interpretable, \(X\) must be able to take on a reasonable value of zero. In this case, since a GPA of zero is nearly impossible, \(\beta_0\) does not have much of an interpretation.

Before moving onto other topics, check out this interactive OLS example.

Similarly, I built this little app where you can compare whatever regression line you want against the one via OLS. I guarantee you will never outperform OLS.

Statistical Inference

Just like how we tested if our sample mean was statistically different from some hypothesized value3 in the last module, we can test if our regression coefficients are different from hypothesized values as well.

Focusing on \(\beta_1\), and skipping the mathimage, we can write the standard error of \(\beta_1\) as follows:

\[se_{\beta_1} = \sqrt{\frac{s^2}{\sum(x_i - \overline{x})^2}}\]

To test if your estimate of \(\beta_i\) is different from another number, we will first generate a test statistic:

\[t = \frac{\beta_i - \#}{se_{\beta_i}}\]

Generally, people choose 0 for \(\#\), since a coefficient of 0 would imply that the regressor (e.g., GPA) has no relationship with the outcome variable (e.g., SAT score). This way, the hypothesis test is set up such that \(\beta_i\) is either related to \(Y\) or it is not.

Once we calculate this test statistic, we can calculate a p-value and decide whether to reject the null hypothesis that our regressor is not related to the outcome variable.

As a final note about inference (i.e., hypothesis tests), we need to make two important assumptions about our errors (\(\hat{\epsilon}_i\)) in order for the above t-statistic to be valid.

  1. Normality: we assume that our errors are independent, random draws from a normal distribution.
  2. Homoskedasticity: this big word means that the normal distribution we draw our errors from has a constant variance across all points. See below for examples of homoskedastic errors and heteroskedastic errors.
Plot

One graph of homoskedastic errors next to another graph of homoskedastic errors.

Goodness of Fit

Once we estimate \(\beta_0\) and \(\beta_1\), we can also begin to talk about how well the model “fits” the data. In other words, what fraction of \(Y\)’s variance is explained by \(X\).

Before getting too much further with this, I want to be very clear that goodness-of-fit measures do not determine if your model is good. You may be able to explain a lot of \(Y\)’s variation, and still have a bad model. In addition, your model might have poor overall explanatory power, but still be a perfect explainer of what you’re interested in. We will discuss this more as the course progresses.

Below are three equations for Total Sum of Squares (SST), Explained Sum of Squares (SSE), and Residual Sum of Squares (SSR). Hover over the equations for explanations.

\[\text{SST} = \sum (Y_i - \overline{Y})^2\]

\[\text{SSE} = \sum (\hat{Y}_i - \overline{Y})^2\]

\[\text{SSR} = \sum \hat{\epsilon_i}^2 = \sum (Y_i - \hat{Y}_i)^2\]

Once we have defined these three measures, we can connect all three by the following equation:

\[SST = SSE + SSR\]

If our model is very good at explaining the variation in \(Y\), we would expect SSE to be very large relative to SSR. For a perfect model, we would see \(SST = SSE\). Therefore, we can create a ratio between these two numbers, which will give us a percentage of the variation that is explained by the model. We call this ratio \(R^2\), and define it as follows:

\[R^2 = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\]

Footnotes

  1. I am saying you would adjust downwards because the average student, according to the data we used, has a GPA of 3.2. A student with an average GPA will likely correspond to a student with an average SAT score. Therefore, since we have a positive correlation between the two variables, a below average GPA will correspond to a below average SAT score.↩︎

  2. If you’re interested in more of this, you can find some derivations on stackexchange.↩︎

  3. In the notes, we used zero, but the hypothesized value could have been anything.↩︎